Statistical Science 

2007, Vol. 22, No. 4, 477-505 

DOI: 10.1214/07-STS242 

© Institute of Mathematical Statistics, 2007 



Boosting Algorithms: Regularization, 
Prediction and Model Fitting 



Peter Buhlmann and Torsten Hothorn 



oo 

o 

O 
(N 

< 



m 



H— > 



> 

(N 

in 
(N 

O 
OO 

o 



% 



Abstract. We present a statistical perspective on boosting. Special 
emphasis is given to estimating potentially complex parametric or non- 
parametric models, including generalized linear and additive models as 
well as regression models for survival analysis. Concepts of degrees 
of freedom and corresponding Akaike or Bayesian information crite- 
ria, particularly useful for regularization and variable selection in high- 
dimensional covariate spaces, are discussed as well. 

The practical aspects of boosting procedures for fitting statistical 
models are illustrated by means of the dedicated open-source software 
package mboost. This package implements functions which can be used 
for model fitting, prediction and variable selection. It is flexible, al- 
lowing for the implementation of new boosting algorithms optimizing 
user-specified loss functions. 

Key words and phrases: Generalized linear models, generalized ad- 
ditive models, gradient boosting, survival analysis, variable selection, 
software. 



1. INTRODUCTION 

Freund and Schapire's AdaBoost algorithm for clas- 
sification (author?) [29, 30, 31] has attracted much 
attention in the machine learning community (cf. 
[76] , and the references therein) as well as in related 
areas in statistics (author?) [15, 16, 33]. Various ver- 
sions of the AdaBoost algorithm have proven to be 
very competitive in terms of prediction accuracy in a 
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variety of applications. Boosting methods have been 
originally proposed as ensemble methods (see Sec- 
tion 1.1), which rely on the principle of generating 
multiple predictions and majority voting (averag- 
ing) among the individual classifiers. 

Later, Breiman (author?) [15, 16] made a path- 
breaking observation that the AdaBoost algorithm 
can be viewed as a gradient descent algorithm in 
function space, inspired by numerical optimization 
and statistical estimation. Moreover, Friedman, Hastie 
and Tibshirani (author?) [33] laid out further im- 
portant foundations which linked Ada-Boost and 
other boosting algorithms to the framework of sta- 
tistical estimation and additive basis expansion. In 
their terminology, boosting is represented as "stage- 
wise, additive modeling": the word "additive" does 
not imply a model fit which is additive in the co- 
variates (see our Section 4), but refers to the fact 
that boosting is an additive (in fact, a linear) combi- 
nation of "simple" (function) estimators. Also Ma- 
son et al. (author?) [62] and Ratsch, Onoda and 
Midler (author?) [70] developed related ideas which 
were mainly acknowledged in the machine learning 
community. In Hastie, Tibshirani and Friedman (au- 
thor?) [42], additional views on boosting are given; 



P. BUHLMANN AND T. HOTHORN 



in particular, the authors first pointed out the re- 
lation between boosting and ^-penalized estima- 
tion. The insights of Friedman, Hastie and Tibshi- 
rani (author?) [33] opened new perspectives, namely 
to use boosting methods in many other contexts 
than classification. We mention here boosting meth- 
ods for regression (including generalized regression) 
[22, 32, 71], for density estimation [73], for survival 
analysis [45, 71] or for multivariate analysis [33, 59]. 
In quite a few of these proposals, boosting is not only 
a black-box prediction tool but also an estimation 
method for models with a specific structure such as 
linearity or additivity [18, 22, 45]. Boosting can then 
be seen as an interesting regularization scheme for 
estimating a model. This statistical perspective will 
drive the focus of our exposition of boosting. 

We present here some coherent explanations and 
illustrations of concepts about boosting, some deriva- 
tions which are novel, and we aim to increase the 
understanding of some methods and some selected 
known results. Besides giving an overview on theo- 
retical concepts of boosting as an algorithm for fit- 
ting statistical models, we look at the methodology 
from a practical point of view as well. The dedicated 
add-on package mboost ("model-based boosting," 
[43]) to the R system for statistical computing [69] 
implements computational tools which enable the 
data analyst to compute on the theoretical concepts 
explained in this paper as closely as possible. The 
illustrations presented throughout the paper focus 
on three regression problems with continuous, bi- 
nary and censored response variables, some of them 
having a large number of covariates. For each ex- 
ample, we only present the most important steps of 
the analysis. The complete analysis is contained in 
a vignette as part of the mboost package (see Ap- 
pendix A.l) so that every result shown in this paper 
is reproducible. 

Unless stated differently, we assume that the data 
are realizations of random variables 

(X 1 ,Y l ),...,(X n ,Y n ) 

from a stationary process with p-dimensional predic- 
tor variables X{ and one-dimensional response vari- 
ables Yi] for the case of multivariate responses, some 
references are given in Section 9.1. In particular, 
the setting above includes independent, identically 
distributed (i.i.d.) observations. The generalization 
to stationary processes is fairly straightforward: the 
methods and algorithms are the same as in the i.i.d. 
framework, but the mathematical theory requires 



more elaborate techniques. Essentially, one needs to 
ensure that some (uniform) laws of large numbers 
still hold, for example, assuming stationary, mixing 
sequences; some rigorous results are given in [57] and 
[59]. 

1.1 Ensemble Schemes: Multiple Prediction and 
Aggregation 

Ensemble schemes construct multiple function es- 
timates or predictions from reweighted data and use 
a linear (or sometimes convex) combination thereof 
for producing the final, aggregated estimator or pre- 
diction. 

First, we specify a base procedure which constructs 
a function estimate g(-) with values in K, based on 
some data (X\,Y\), . . . , (X n ,Y n ): 



(X 1 ,Y 1 ),...,(X n ,Y n ) 



base procedure 



$(•)• 



For example, a very popular base procedure is a re- 
gression tree. 

Then, generating an ensemble from the base pro- 
cedures, that is, an ensemble of function estimates 
or predictions, works generally as follows: 

base procedure 



reweighted data 1 
reweighted data 2 

reweighted data M 







[!](.) 



base procedure 



base procedure 



M 



9 [2] (-) 



)Wl 



aggregation: /a(-) = E a m g [mi (-). 

m=l 

What is termed here as "reweighted data" means 
that we assign individual data weights to each of 
the n sample points. We have also implicitly as- 
sumed that the base procedure allows to do some 
weighted fitting, that is, estimation is based on a 
weighted sample. Throughout the paper (except in 
Section 1.2), we assume that a base procedure esti- 
mate <?(•) is real- valued (i.e., a regression procedure), 
making it more adequate for the "statistical perspec- 
tive" on boosting, in particular for the generic FGD 
algorithm in Section 2.1. 

The above description of an ensemble scheme is 
too general to be of any direct use. The specification 
of the data reweighting mechanism as well as the 
form of the linear combination coefficients {ce m }^ =1 
are crucial, and various choices characterize differ- 
ent ensemble schemes. Most boosting methods are 
special kinds of sequential ensemble schemes, where 
the data weights in iteration m depend on the results 
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from the previous iteration m — 1 only (memoryless 
with respect to iterations m — 2, m — 3, . . .). Exam- 
ples of other ensemble schemes include bagging [14] 
or random forests [1, 17]. 

1.2 AdaBoost 

The AdaBoost algorithm for binary classification 
[31] is the most well-known boosting algorithm. The 
base procedure is a classifier with values in {0, 1} 
(slightly different from a real-valued function esti- 
mator as assumed above), for example, a classifica- 
tion tree. 

AdaBoost algorithm 

1. Initialize some weights for individual sample 
points: w\ = l/n for i = 1, . . . ,n. Set m = 0. 

2. Increase m by 1. Fit the base procedure to the 
weighted data, that is, do a weighted fitting using 
the weights wf 1 , yielding the classifier g^ m \-)- 

3. Compute the weighted in-sample misclassifica- 
tion rate 



errM = £ w ^I(Y t ± $M (*))/£ w\ 



[m-l] 



i=l 



i=l 



C*™ = lOB 



1 — err ; 



em 



and update the weights 

Wi = w [ r 1] exp(aH /( y. + gW( Xi ))), 



\ rn \ ~ i x~^ ~ 



4. Iterate steps 2 and 3 until m = m stop and build 
the aggregated classifier by weighted majority vot- 



ing: 



■mstop 



/AdaBoost 0) = argmax z 

</e{o,i} m=1 



^ cJ m 



I{gW(x)=y). 



By using the terminology m st0 p (instead of M as 
in the general description of ensemble schemes), we 
emphasize here and later that the iteration process 
should be stopped to avoid overfitting. It is a tuning 
parameter of AdaBoost which may be selected using 
some cross-validation scheme. 

1.3 Slow Overfitting Behavior 

It had been debated until about the year 2000 
whether the AdaBoost algorithm is immune to over- 
fitting when running more iterations, that is, stop- 
ping would not be necessary. It is clear nowadays 



that Ada-Boost and also other boosting algorithms 
are overfitting eventually, and early stopping [using 
a value of m stop before convergence of the surrogate 
loss function, given in (3.3), takes place] is necessary 
[7, 51, 64]. We emphasize that this is not in con- 
tradiction to the experimental results by (author?) 
[15] where the test set misclassification error still 
decreases after the training misclassification error is 
zero [because the training error of the surrogate loss 
function in (3.3) is not zero before numerical con- 
vergence] . 

Nevertheless, the AdaBoost algorithm is quite re- 
sistant to overfitting (slow overfitting behavior) when 
increasing the number of iterations m s t p- This has 
been observed empirically, although some cases with 
clear overfitting do occur for some datasets [64]. A 
stream of work has been devoted to develop VC-type 
bounds for the generalization (out-of-sample) error 
to explain why boosting is overfitting very slowly 
only. Schapire et al. (author?) [77] proved a remark- 
able bound for the generalization misclassification 
error for classifiers in the convex hull of a base proce- 
dure. This bound for the misclassification error has 
been improved by Koltchinskii and Panchenko (au- 
thor?) [53], deriving also a generalization bound for 
AdaBoost which depends on the number of boosting 
iterations. 

It has been argued in [33], rejoinder, and [21] that 
the overfitting resistance (slow overfitting behav- 
ior) is much stronger for the misclassification error 
than many other loss functions such as the (out-of- 
sample) negative log-likelihood (e.g., squared error 
in Gaussian regression). Thus, boosting's resistance 
of overfitting is coupled with a general fact that over- 
fitting is less an issue for classification (i.e., the 0-1 
loss function). Furthermore, it is proved in [6] that 
the misclassification risk can be bounded by the risk 
of the surrogate loss function: it demonstrates from 
a different perspective that the 0-1 loss can exhibit 
quite a different behavior than the surrogate loss. 

Finally, Section 5.1 develops the variance and bias 
for boosting when utilized to fit a one-dimensional 
curve. Figure 5 illustrates the difference between 
the boosting and the smoothing spline approach, 
and the eigen-analysis of the boosting method [see 
(5.2)] yields the following: boosting's variance in- 
creases with exponentially small increments while 
its squared bias decreases exponentially fast as the 
number of iterations grows. This also explains why 
boosting's overfitting kicks in very slowly. 



P. BUHLMANN AND T. HOTHORN 



1.4 Historical Remarks 

The idea of boosting as an ensemble method for 
improving the predictive performance of a base pro- 
cedure seems to have its roots in machine learning. 
Kearns and Valiant (author?) [52] proved that if in- 
dividual classifiers perform at least slightly better 
than guessing at random, their predictions can be 
combined and averaged, yielding much better pre- 
dictions. Later, Schapire (author?) [75] proposed a 
boosting algorithm with provable polynomial run- 
time to construct such a better ensemble of classi- 
fiers. The AdaBoost algorithm [29, 30, 31] is consid- 
ered as a first path-breaking step toward practically 
feasible boosting algorithms. 

The results from Breiman (author?) [15, 16], show- 
ing that boosting can be interpreted as a functional 
gradient descent algorithm, uncover older roots of 
boosting. In the context of regression, there is an 
immediate connection to the Gauss-Southwell al- 
gorithm [79] for solving a linear system of equa- 
tions (see Section 4.1) and to Tukey's [83] method 
of "twicing" (see Section 5.1). 

2. FUNCTIONAL GRADIENT DESCENT 

Breiman (author?) [15, 16] showed that the Ad- 
aBoost algorithm can be represented as a steep- 
est descent algorithm in function space which we 
call functional gradient descent (FGD). Friedman, 
Hastie and Tibshirani (author?) [33] and Friedman 
(author?) [32] then developed a more general, statis- 
tical framework which yields a direct interpretation 
of boosting as a method for function estimation. In 
their terminology, it is a "stagewise, additive mod- 
eling" approach (but the word "additive" does not 
imply a model fit which is additive in the covariates; 
see Section 4). Consider the problem of estimating 
a real- valued function 

(2.1) /*(•) = argminE[p(y,/(X))], 

/(■) 

where p(-, •) is a loss function which is typically as- 
sumed to be differentiable and convex with respect 
to the second argument. For example, the squared 
error loss p(y,f) = \y — f\ 2 yields the well-known 
population minimizer f*(x) =E\Y\X = x\. 

2.1 The Generic FGD or Boosting Algorithm 

In the sequel, FGD and boosting are used as equiv- 
alent terminology for the same method or algorithm. 

Estimation of /*(•) in (2.1) with boosting can be 
done by considering the empirical risk n _1 Y^i=i p(Xii 



f{Xi)) and pursuing iterative steepest descent in 
function space. The following algorithm has been 
given by Friedman (author?) [32]: 

Generic FGD algorithm 

1. Initialize p°'(-) with an offset value. Common 
choices are 

n 

p'(-) = argminn - YJ p(Yj,c) 
i=\ 

or / [01 (-) = 0. Set m = 0. 

2. Increase m by 1. Compute the negative gradient 
—§jp{YJ) and evaluate at f m -^(Xi): 



U; 



d_ 
~df 



P\Xh /)l/=/[m- l](Xi)> * 



1, ... ,71. 



3. Fit the negative gradient vector U\,...,U n to 
X\, . . . , X n by the real- valued base procedure (e.g., 
regression) 



(Xi,Ui 



base procedure 



i)i=\ 



9 l 



Thus, g\ m \-) can be viewed as an approximation 
of the negative gradient vector. 

4. Update /H (■) =/ [m ~ 1] (0 + v- g [m] (-), where 0< 
v < 1 is a step-length factor (see below), that is, 
proceed along an estimate of the negative gradi- 
ent vector. 

5. Iterate steps 2 to 4 until m = m stop for some stop- 
ping iteration TO st0 p- 

The stopping iteration, which is the main tuning 
parameter, can be determined via cross-validation 
or some information criterion; see Section 5.4. The 
choice of the step-length factor v in step 4 is of minor 
importance, as long as it is "small," such as v = 0.1. 
A smaller value of v typically requires a larger num- 
ber of boosting iterations and thus more computing 
time, while the predictive accuracy has been em- 
pirically found to be potentially better and almost 
never worse when choosing v "sufficiently small" 
(e.g., v = 0.1) [32]. Friedman (author?) [32] suggests 
to use an additional line search between steps 3 and 
4 (in case of other loss functions p(-,-) than squared 
error): it yields a slightly different algorithm but the 
additional line search seems unnecessary for achiev- 
ing a good estimator fi mst °p\ . The latter statement is 
based on empirical evidence and some mathematical 
reasoning as described at the beginning of Section 7. 
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2.1.1 Alternative formulation in function space. 
In steps 2 and 3 of the generic FGD algorithm, we 
associated with U± , . . . , U n a negative gradient vec- 
tor. A reason for this can be seen from the following 
formulation in function space which is similar to the 
exposition in Mason et al. (author?) [62] and to the 
discussion in Ridgeway (author?) [72]. 

Consider the empirical risk functional C(f) = 
n~ J2i=i p{Yi-> f (Xi)) and the usual inner product 
(/, g) = n~ l Yh=i f(Xi)g(Xi). We can then calculate 
the negative Gateaux derivative dC(-) of the func- 
tional C(-), 



dC(f)(x) 



d 
~K-C(f + a6 x )\ a =o, 

f:W^R. 



x<E 



where 5 X denotes the delta- (or indicator-) function 
at x £ R p . In particular, when evaluating the deriva- 
tive -dC at /I" 1 - 1 ! and X h we get 

- dC (fl™-i])(X i ) = n- l U i , 

with Ui,...,U n exactly as in steps 2 and 3 of the 
generic FGD algorithm. Thus, the negative gradient 
vector U\ , . . . , U n can be interpreted as a functional 
(Gateaux) derivative evaluated at the data points. 
We point out that the algorithm in Mason et al. 
(author?) [62] is different from the generic FGD 
method above: while the latter is fitting the nega- 
tive gradient vector by the base procedure, typically 
using (nonparametric) least squares, Mason et al. 
(author?) [62] fit the base procedure by maximizing 



— (U,g) = n~ l Y^i=iUig(Xi). For certain base pro- 
cedures, the two algorithms coincide. For example, 
if g(-) is the componentwise linear least squares base 
procedure described in (4.1), it holds that n~ l YA=i(J^i ' 



g(Xi)) 2 = 

constant. 



C - (U,g), where C = n -1 £?=i U? is a 



3. SOME LOSS FUNCTIONS AND BOOSTING 
ALGORITHMS 

Various boosting algorithms can be defined by 
specifying different (surrogate) loss functions p(-, •). 
The mboost package provides an environment for 
defining loss functions via boost-family objects, as 
exemplified below. 

3.1 Binary Classification 

For binary classification, the response variable is 
Y e {0, 1} with F[Y = 1] =p. Often, it is notation- 
ally more convenient to encode the response by Y = 
2Y — 1 £ {—1, +1} (this coding is used in mboost as 
well) . We consider the negative binomial log-likelihood 
as loss function: 

-(y log(p) + (1 - y) log(l - p)). 

We parametrize p = exp(/)/(exp(/) + exp(— /)) so 
that / = log(p/(l —p))/2 equals half of the log-odds 
ratio; the factor 1/2 is a bit unusual but it will en- 
able that the population minimizer of the loss in 
(3.1) is the same as for the exponential loss in (3.3) 
below. Then, the negative log-likelihood is 

log(l + exp(-2y/)). 



monotone 



non-monotone 
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Fig. 1. Losses, as functions of the margin yf = (2y — l)f, for binary classification. Left panel with monotone loss functions: 
0-1 loss, exponential loss, negative log-likelihood, hinge loss (SVM); right panel with nonmonotone loss functions: squared error 
(L2) and absolute error (L\) as in (3.5). 
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By scaling, we prefer to use the equivalent loss func- 
tion 

(3.1) pio g -iik(y, /) = log 2 (l + exp(-2y/)), 

which then becomes an upper bound of the misclas- 
sification error; see Figure 1. In mboost, the neg- 
ative gradient of this loss function is implemented 
in a function Binomial () returning an object of 
class boost_family which contains the negative gra- 
dient function as a slot (assuming a binary response 
variable y £ {—1, +!})■ 

The population minimizer can be shown to be (cf. 
[33]) 



fio g -hk( x ) = 2 lo £ 



p(x) 



l\X 



l-p(x) 

p(x)=F[Y-- 

The loss function in (3.1) is a function of yf, the 
so-called margin value, where the function / induces 
the following classifier for Y: 

(1, if/(x)>0, 

C(x) = lo, if f(x) < 0, 

[ undetermined, if f(x) = 0. 

Therefore, a misclassification (including the unde- 
termined case) happens if and only if Y f{X) < 0. 
Hence, the misclassification loss is 



(3.2) 



po-x(y,f) = it 



fo-i(x) 



! {y/<o}' 

whose population minimizer is equivalent to the 
Bayes classifier (for Y G {— 1,+1}) 

+1, ifp(x)>l/2, 

-1, ifp(x)<l/2, 

where p(x) = F[Y = l\X = x]. Note that the 0-1 loss 
in (3.2) cannot be used for boosting or FGD: it is 
nondifferentiable and also nonconvex as a function of 
the margin value yf . The negative log-likelihood loss 
in (3.1) can be viewed as a convex upper approxima- 
tion of the (computationally intractable) nonconvex 
0-1 loss; see Figure 1. We will describe in Section 3.3 
the BinomialBoosting algorithm (similar to Logit- 
Boost [33]) which uses the negative log- likelihood as 
loss function (i.e., the surrogate loss which is the 
implementing loss function for the algorithm). 

Another upper convex approximation of the 0-1 
loss function in (3.2) is the exponential loss 

(3.3) Pcx P (y, /) = exp(-y/), 

implemented (with notation y € {—1, +1}) in mboost 
as AdaExp () family. 



The population minimizer can be shown to be the 
same as for the log-likelihood loss (cf. [33]): 



/cx P (^) = 2 lo £ 



P(x) 
1 — p(x) 

p(x) = 



\Y = 1\X 



Using functional gradient descent with different 
(surrogate) loss functions yields different boosting 
algorithms. When using the log-likelihood loss in 
(3.1), we obtain LogitBoost [33] or BinomialBoost- 
ing from Section 3.3; and with the exponential loss 
in (3.3), we essentially get AdaBoost [30] from Sec- 
tion 1.2. 

We interpret the boosting estimate p m '(-) as an 
estimate of the population minimizer /*(•)■ Thus, 
the output from AdaBoost, Logit- or BinomialBoost- 
ing are estimates of half of the log-odds ratio. In 
particular, we define probability estimates via 



pH( s ) 



exp 



(/N(x)) 



exp(/H(x)) +exp(-/M(x)) 



The reason for constructing these probability esti- 
mates is based on the fact that boosting with a 
suitable stopping iteration is consistent [7, 51]. Some 
cautionary remarks about this line of argumentation 
are presented by Mease, Wyner and Buja (author?) 
[64]. 

Very popular in machine learning is the hinge func- 
tion, the standard loss function for support vector 
machines: 

psvM.(y,f) = [i-yf]+, 

where [x]+ =xI{ x> oy denotes the positive part. It is 
also an upper convex bound of the misclassification 
error; see Figure 1. Its population minimizer is 



./< 



SVMl 



sign(p(x) -1/2), 



which is the Bayes classifier for Fg{-1,+1}. Since 
/svm(') ^ s a classifier and noninvertible function of 
p(x), there is no direct way to obtain conditional 
class probability estimates. 

3.2 Regression 

For regression with response F£l, we use most 
often the squared error loss (scaled by the factor 
1/2 such that the negative gradient vector equals 
the residuals; see Section 3.3 below), 



(3.4) 



pL 2 (y,f) 



\\y 



ft 
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with population minimizer 

ft 2 (x)=E[Y\X 



x\ 



Moreover, both the L\- and L2-I0SS functions can 
be parametrized as functions of the margin value 



The corresponding boosting algorithm is Li Boosting; 
see Friedman (author?) [32] and Biihlmann and Yu 
(author?) [22]. It is described in more detail in Sec- 
tion 3.3. This loss function is available in mboost as 
family GaussRegO. 

Alternative loss functions which have some ro- 
bustness properties (with respect to the error dis- 
tribution, i.e., in "Y-space") include the L\- and 
Huber-loss. The former is 

PL 1 (y,f) = \y- f\ 

with population minimizer 

f*(x) = median(Y|X = x) 

and is implemented in mboost as Laplace (). 

Although the Li-loss is not differentiable at the 
point y = f, we can compute partial derivatives since 
the single point y = f (usually) has probability zero 
to be realized by the data. A compromise between 
the L\- and L2-I0SS is the Huber-loss function from 
robust statistics: 

PHubcr(2/,/) 

\y-f\ 2 /2, ti\y-f\<6, 

8(\y-f\-8/2), if \y-f\> 5, 

which is available in mboost as Ruber (). A strat- 
egy for choosing (a changing) 5 adaptively has been 
proposed by Friedman (author?) [32]: 

5 m = median({|Y i - f m ~ 1 \x i )\-i = 1, . . . ,n}), 

where the previous fit p m ~ 1 '(-) is used. 

3.2.1 Connections to binary classification. Moti- 
vated from the population point of view, the L%- or 
Li-loss can also be used for binary classification. For 
Y € {0, 1}, the population minimizers are 

fi 2 (x)=K[Y\X = x] 

= p(x)=F[Y = l\X = x], 
/£ 1 (x) = median(Y|Y = x) 
1, ifp(x)>l/2, 



0, ifp(x)<l/2. 

Thus, the population minimizer of the Li-loss is the 
Bayes classifier. 



(3.5) 



\y 



y-f\ 
f\ 2 



|i-y/l, 

\i-yf? 

(l-2y/ + (y/) 2 ) 



The L\- and L2-I0SS functions are nonmonotone func- 
tions of the margin value yf; see Figure 1. A nega- 
tive aspect is that they penalize margin values which 
are greater than 1: penalizing large margin values 
can be seen as a way to encourage solutions / G 
[—1,1] which is the range of the population mini- 
mizers /£ and /£ (for Y E {— 1,+1}), respectively. 
However, as discussed below, we prefer to use mono- 
tone loss functions. 

The L2-I0SS for classification (with response vari- 
able y E { — 1,+1}) is implemented inGaussClassQ. 

All loss functions mentioned for binary classifica- 
tion (displayed in Figure 1) can be viewed and inter- 
preted from the perspective of proper scoring rules; 
cf. Buja, Stuetzle and Shen (author?) [24]. We usu- 
ally prefer the negative log- likelihood loss in (3.1) 
because: (i) it yields probability estimates; (ii) it is 
a monotone loss function of the margin value yf\ 
(iii) it grows linearly as the margin value yf tends 
to —00, unlike the exponential loss in (3.3). The 
third point reflects a robustness aspect: it is similar 
to Huber's loss function which also penalizes large 
values linearly (instead of quadratically as with the 
Z/2-loss). 

3.3 Two Important Boosting Algorithms 

Table 1 summarizes the most popular loss func- 
tions and their corresponding boosting algorithms. 
We now describe the two algorithms appearing in 
the last two rows of Table 1 in more detail. 

3.3.1 LiBoosting. Z^Boosting is the simplest and 
perhaps most instructive boosting algorithm. It is 
very useful for regression, in particular in presence of 
very many predictor variables. Applying the general 
description of the FGD algorithm from Section 2.1 
to the squared error loss function PL 2 {y->f) = \y ~ 
/| 2 /2, we obtain the following algorithm: 
L2 Boosting algorithm 

1. Initialize p°'(-) with an offset value. The default 
value is /M(-) = Y. Set m = 0. 

2. Increase m by 1. Compute the residuals Ui = Yi — 
flm-i](Xi) for i = l,...,n. 
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3. Fit the residual vector Ui,...,U n to X\, . . . , X n 
by the real- valued base procedure (e.g., regres- 
sion): 

(Y TT\n base procedure A [ m ] M 

4. Update /M (.) = /I™- 1 ! (•) + z/ ■ <? H (•), where < 
v <\ is a step-length factor (as in the general 
FGD algorithm). 

5. Iterate steps 2 to 4 until m = m stop for some stop- 
ping iteration m stop . 

The stopping iteration m st0 p is the main tuning 
parameter which can be selected using cross-valida- 
tion or some information criterion as described in 
Section 5.4. 

The derivation from the generic FGD algorithm 
in Section 2.1 is straightforward. Note that the neg- 
ative gradient vector becomes the residual vector. 
Thus, L2 Boosting amounts to refitting residuals mul- 
tiple times. Tukey (author?) [83] recognized this to 
be useful and proposed "twicing," which is nothing 
else than Z^Boosting using m st0 p = 2 (and v = 1). 

3.3.2 BinomialB 00 sting: the FGD version of Logit- 
Boost. We already gave some reasons at the end of 
Section 3.2.1 why the negative log-likelihood loss 
function in (3.1) is very useful for binary classifi- 
cation problems. Friedman, Hastie and Tibshirani 
(author?) [33] were first in advocating this, and they 
proposed Logit-Boost, which is very similar to the 
generic FGD algorithm when using the loss from 
(3.1): the deviation from FGD is the use of New- 
ton's method involving the Hessian matrix (instead 
of a step-length for the gradient). 

For the sake of coherence with the generic func- 
tional gradient descent algorithm in Section 2.1, we 
describe here a version of LogitBoost; to avoid con- 
flicting terminology, we name it BinomialBoosting: 
BinomialBoosting algorithm 

Apply the generic FGD algorithm from Section 2.1 
using the loss function /0i O g-iik horn (3.1). The de- 
fault offset value is /^(-) = log(p/(l —p))/2, where 
p is the relative frequency of Y = 1 . 



With BinomialBoosting, there is no need that the 
base procedure is able to do weighted fitting; this 
constitutes a slight difference to the requirement for 
Logit-Boost [33]. 

3.4 Other Data Structures and Models 

Due to the generic nature of boosting or func- 
tional gradient descent, we can use the technique in 
very many other settings. For data with univariate 
responses and loss functions which are differentiable 
with respect to the second argument, the boosting 
algorithm is described in Section 2.1. Survival analy- 
sis is an important area of application with censored 
observations; we describe in Section 8 how to deal 
with it. 

4. CHOOSING THE BASE PROCEDURE 

Every boosting algorithm requires the specifica- 
tion of a base procedure. This choice can be driven 
by the aim of optimizing the predictive capacity only 
or by considering some structural properties of the 
boosting estimate in addition. We find the latter 
usually more interesting as it allows for better in- 
terpretation of the resulting model. 

We recall that the generic boosting estimator is a 
sum of base procedure estimates 



/>*](.) = „£>*](.). 



fe=i 

Therefore, structural properties of the boosting func- 
tion estimator are induced by a linear combination 
of structural characteristics of the base procedure. 

The following important examples of base proce- 
dures yield useful structures for the boosting esti- 
mator p m '(-)- The notation is as follows: g(-) is an 
estimate from a base procedure which is based on 
data (Xi,Ui),...,(X n ,U n ) where (Ui,...,U n ) de- 
notes the current negative gradient. In the sequel, 
the jth. component of a vector c will be denoted by 



Table 1 
Various loss functions p(y,f), population minimizers f*(x) and names of corresponding boosting algorithms; 

p(x)=F[Y = l\X = x] 
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4.1 Componentwise Linear Least Squares for 
Linear Models 

Boosting can be very useful for fitting potentially 
high-dimensional generalized linear models. Consider 
the base procedure 

n n 

(4.i) /^ = £x?W£(Af>) 2 , 



X 



(j) 



X> — X .In case of a linear model, when 



«=i 



i=l 



S = argmin V(C/i - (3^x\ j) f. 

It selects the best variable in a simple linear model 
in the sense of ordinary least squares fitting. 

When using Li Boosting with this base procedure, 
we select in every iteration one predictor variable, 
not necessarily a different one for each iteration, and 
we update the function linearly: 

fH( x ) = f [m ~ 1] {x) + up iS ^x iSm \ 

where S m denotes the index of the selected predictor 
variable in iteration m. Alternatively, the update of 
the coefficient estimates is 

The notation should be read that only the <S m th 
component of the coefficient estimate P'-" 1 ' (in iter- 
ation m) has been updated. For every iteration m, 
we obtain a linear model fit. As m tends to infinity, 
P m '(-) converges to a least squares solution which is 
unique if the design matrix has full rank p<n. The 
method is also known as matching pursuit in signal 
processing [60], weak greedy algorithm in computa- 
tional mathematics [81], and it is a Gauss-Southwell 
algorithm [79] for solving a linear system of equa- 
tions. We will discuss more properties of Li Boosting 
with componentwise linear least squares in Section 
5.2. 

When using BinomialBoosting with component- 
wise linear least squares from (4.1), we obtain a fit, 
including variable selection, of a linear logistic re- 
gression model. 

As will be discussed in more detail in Section 5.2, 
boosting typically shrinks the (logistic) regression 
coefficients toward zero. Usually, we do not want to 
shrink the intercept term. In addition, we advocate 
to use boosting on mean centered predictor variables 



centering also the response Y^ = Y^ — Y, this becomes 
p 
y. = ^/jO')A > p ) +noise i 
i=i 
which forces the regression surface through the cen- 
ter (xW , . . . , x^ , y) = (0, 0, . . . , 0) as with ordinary 
least squares. Note that it is not necessary to cen- 
ter the response variables when using the default 
offset value p°> =Y in Li Boosting. [For Binomi- 
alBoosting, we would center the predictor variables 
only but never the response, and we would use p°> = 
axgmin c n _1 £™ =1 p(Yi,c).] 

Illustration: Prediction of total body fat. Garcia et 
al. (author?) [34] report on the development of pre- 
dictive regression equations for body fat content by 
means of p = 9 common anthropometric measure- 
ments which were obtained for n = 71 healthy Ger- 
man women. In addition, the women's body compo- 
sition was measured by dual energy X-ray absorp- 
tiometry (DXA). This reference method is very ac- 
curate in measuring body fat but finds little appli- 
cability in practical environments, mainly because 
of high costs and the methodological efforts needed. 
Therefore, a simple regression equation for predict- 
ing DXA measurements of body fat is of special 
interest for the practitioner. Backward-elimination 
was applied to select important variables from the 
available anthropometrical measurements and Gar- 
cia et al. (author?) [34] report a final linear model 
utilizing hip circumference, knee breadth and a com- 
pound covariate which is defined as the sum of log 
chin skinfold, log triceps skinfold and log subscapu- 
lar skinfold: 
R> bf_lm <- lm(DEXfat ~ hipcirc 

+ kneebreadth 

+ anthro3a, 

data = bodyfat) 
R> coef (bf_lm) 

(Intercept) hipcirc kneebreadth anthro3a 
-75.23478 0.51153 1.90199 8.90964 

A simple regression formula which is easy to com- 
municate, such as a linear combination of only a 
few covariates, is of special interest in this applica- 
tion: we employ the glmboost function from pack- 
age mboost to fit a linear regression model by means 
of ^Boosting with componentwise linear least squares. 
By default, the function glmboost fits a linear model 
(with initial m s top = 100 and shrinkage parameter 
v = 0.1) by minimizing squared error (argument family 
= GaussRegQ is the default): 
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R> bf_glm <- glmboost(DEXfat ~ ., 
data = bodyfat, 
control= boost_control 
(center = TRUE)) 
Note that, by default, the mean of the response 
variable is used as an offset in the first step of the 
boosting algorithm. We center the covariates prior 
to model fitting in addition. As mentioned above, 
the special form of the base learner, that is, compo- 
nentwise linear least squares, allows for a reformu- 
lation of the boosting fit in terms of a linear combi- 
nation of the covariates which can be assessed via 
R> coef (bf_glm) 

(Intercept) age waistcirc hipcirc 

0.000000 0.013602 0.189716 0.351626 
elbowbreadth kneebreadth anthro3a anthro3b 

-0.384140 1.736589 3.326860 3.656524 

anthro3c anthro4 

0.595363 0.000000 
attr(, "offset") 
[1] 30.783 

We notice that most covariates have been used for 
fitting and thus no extensive variable selection was 
performed in the above model. Thus, we need to in- 
vestigate how many boosting iterations are appro- 
priate. Resampling methods such as cross-validation 
or the bootstrap can be used to estimate the out- 
of-sample error for a varying number of boosting it- 
erations. The out-of-bootstrap mean squared error 
for 100 bootstrap samples is depicted in the upper 
part of Figure 2. The plot leads to the impression 
that approximately m stop = 44 would be a sufficient 
number of boosting iterations. In Section 5.4, a cor- 
rected version of the Akaike information criterion 
(AIC) is proposed for determining the optimal num- 
ber of boosting iterations. This criterion attains its 
minimum for 
R> mstop (aic <- AIC(bf _glm)) 

[1] 45 

boosting iterations; see the bottom part of Fig- 
ure 2 in addition. The coefficients of the linear model 
with m stop = 45 boosting iterations are 
R> coef (bf_glm [mstop (aic)] ) 

(Intercept) age waistcirc hipcirc 

0.0000000 0.0023271 0.1893046 0.3488781 

elbowbreadth kneebreadth anthro3a anthro3b 

0.0000000 1.5217686 3.3268603 3.6051548 

anthro3c anthro4 
0.5043133 0.0000000 
attr(, "offset") 
[1] 30.783 

and thus seven covariates have been selected for 
the final model (intercept equal to zero occurs here 
for mean centered response and predictors and hence, 
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Fig. 2. bodyfat data: Out-of-bootstrap squared error for 
varying number of boosting iterations m Btop (top). The dashed 
horizontal line depicts the average out-of-bootstrap error 
of the linear model for the preselected variables hipcirc, 
kneebreadth and anthro3a fitted via ordinary least squares. 
The lower part shows the corrected AIC criterion. 



n~ 1 J27=i Yi = 30.783 is the intercept in the uncen- 
tered model). Note that the variables hipcirc, 
kneebreadth and anthro3a, which we have used for 
fitting a linear model at the beginning of this para- 
graph, have been selected by the boosting algorithm 
as well. 

4.2 Componentwise Smoothing Spline for 
Additive Models 

Additive and generalized additive models, intro- 
duced by Hastie and Tibshirani (author?) [40] (see 
also [41]), have become very popular for adding more 
flexibility to the linear structure in generalized lin- 
ear models. Such flexibility can also be added in 
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boosting (whose framework is especially useful for 
high-dimensional problems). 

We can choose to use a nonparametric base pro- 
cedure for function estimation. Suppose that 

f^'(-) is a least squares cubic smoothing 

, , spline estimate based on Ui,...,U n against 
(4.2) (j\ (j\ 

X\ ,... , Xn with fixed degrees of freedom 

df. 
That is, 



/(■) i=\ 
+ \[(f"(x)) 2 dx, 



(4.3) 



where A > is a tuning parameter such that the 
trace of the corresponding hat matrix equals df. For 
further details, we refer to Green and Silverman (au- 
thor?) [36]. As a note of caution, we use in the sequel 
the terminology of "hat matrix" in a broad sense: it 
is a linear operator but not a projection in general. 
The base procedure is then 

f^'(-) as above and 

n 

S = argmin^T^ - p\X^)f, 
!<i<P i=i 

where the degrees of freedom df are the same for all 

^Boosting with componentwise smoothing splines 
yields an additive model, including variable selec- 
tion, that is, a fit which is additive in the predic- 
tor variables. This can be seen immediately since 
L2 Boosting proceeds additively for updating the func- 
tion p m '(-); see Section 3.3. We can normalize to 
obtain the following additive model estimator: 



/>l(x) = /i + £/M.Ctf(s&*>), 



i=i 



n 



11 

-i£/M,a) (x p) ) = for an j = 1,..., p. 



i=\ 



As with the componentwise linear least squares base 
procedure, we can use componentwise smoothing 
splines also in BinomialBoosting, yielding an addi- 
tive logistic regression fit. 

The degrees of freedom in the smoothing spline 
base procedure should be chosen "small" such as 



df = 4. This yields low variance but typically large 
bias of the base procedure. The bias can then be re- 
duced by additional boosting iterations. This choice 
of low variance but high bias has been analyzed 
in Buhlmann and Yu (author?) [22]; see also Sec- 
tion 4.4. 

Componentwise smoothing splines can be gener- 
alized to pairwise smoothing splines which search 
for and fit the best pairs of predictor variables such 
that smoothing of U\ , . . . , U n against this pair of pre- 
dictors reduces the residual sum of squares most. 
With Z^Boosting, this yields a nonparametric model 
fit with first-order interaction terms. The procedure 
has been empirically demonstrated to be often much 
better than fitting with MARS [23]. 

Illustration: Prediction of total body fat [cont.). 
Being more flexible than the linear model which we 
fitted to the bodyfat data in Section 4.1, we esti- 
mate an additive model using the gamboost func- 
tion from mboost (first with prespecified ?n s top = 
100 boosting iterations, v = 0.1 and squared error 
loss) : 
R> bf_gam 

<- gamboost (DEXf at ~ ., 

data = bodyfat) 

The degrees of freedom in the componentwise 
smoothing spline base procedure can be defined by 
the df base argument, defaulting to 4. 

We can estimate the number of boosting iterations 
m s top using the corrected AIC criterion described in 
Section 5.4 via 
R> mstop(aic <- AIC(bf _gam)) 

[1] 46 

Similarly to the linear regression model, the par- 
tial contributions of the covariates can be extracted 
from the boosting fit. For the most important vari- 
ables, the partial fits are given in Figure 3 showing 
some slight nonlinearity, mainly for kneebreadth. 

4.3 Trees 

In the machine learning community, regression trees 
are the most popular base procedures. They have 
the advantage to be invariant under monotone trans- 
formations of predictor variables, that is, we do not 
need to search for good data transformations. More- 
over, regression trees handle covariates measured at 
different scales (continuous, ordinal or nominal vari- 
ables) in a unified way; unbiased split or variable se- 
lection in the context of different scales is proposed 
in [47]. 
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Fig. 3. body fat data: Partial contributions of four covariates in an additive model (without centering of estimated functions 
to mean zero). 



When using stumps, that is, a tree with two ter- 
minal nodes only, the boosting estimate will be an 
additive model in the original predictor variables, 
because every stump-estimate is a function of a sin- 
gle predictor variable only. Similarly, boosting trees 
with (at most) d terminal nodes result in a nonpara- 
metric model having at most interactions of order 
d — 2. Therefore, if we want to constrain the degree 
of interactions, we can easily do this by constraining 
the (maximal) number of nodes in the base proce- 
dure. 

Illustration: Prediction of total body fat (cont.). 
Both the gbm package [74] and the mboost package 
are helpful when decision trees are to be used as base 
procedures. In mboost, the function blackboost im- 
plements boosting for fitting such classical black-box 
models: 
R> bf_black 

<- blackboost (DEXf at ~ ., 

data = bodyf at , 

control 



= boost_control 
(mstop = 500)) 
Conditional inference trees [47] as available from 
the party package [46] are utilized as base proce- 
dures. Here, the function boost_control defines the 
number of boosting iterations mstop. 

Alternatively, we can use the function gbm from 
the gbm package which yields roughly the same fit 
as can be seen from Figure 4. 

4.4 The Low-Variance Principle 

We have seen above that the structural properties 
of a boosting estimate are determined by the choice 
of a base procedure. In our opinion, the structure 
specification should come first. After having made 
a choice, the question becomes how "complex" the 
base procedure should be. For example, how should 
we choose the degrees of freedom for the componen- 
twise smoothing spline in (4.2)? A general answer 
is: choose the base procedure (having the desired 
structure) with low variance at the price of larger 



BOOSTING ALGORITHMS AND MODEL FITTING 



13 



i 

I 

C 

g 
I 

i 



s - 



a 






o 




I r 

30 40 

Prediction gbm 



Fig. 4. bodyfat data: Fitted values of both the gbm and 
mboost implementations of L2 Boosting with different regres- 
sion trees as base learners. 

estimation bias. For the componentwise smoothing 
splines, this would imply a low number of degrees of 
freedom, for example, df = 4. 

We give some reasons for the low-variance prin- 
ciple in Section 5.1 (Replica 1). Moreover, it has 
been demonstrated in Friedman (author?) [32] that 
a small step-size factor v can be often beneficial 
and almost never yields substantially worse predic- 
tive performance of boosting estimates. Note that 
a small step-size factor can be seen as a shrinkage 
of the base procedure by the factor u, implying low 
variance but potentially large estimation bias. 

5. L 2 BOOSTING 

L2 Boosting is functional gradient descent using 
the squared error loss which amounts to repeated 
fitting of ordinary residuals, as described already in 
Section 3.3.1. Here, we aim at increasing the under- 
standing of the simple Z^Boosting algorithm. We 
first start with a toy problem of curve estimation, 
and we will then illustrate concepts and results which 
are especially useful for high-dimensional data. These 
can serve as heuristics for boosting algorithms with 
other convex loss functions for problems in for ex- 
ample, classification or survival analysis. 

5.1 Nonparametric Curve Estimation: From 
Basics to Asymptotic Optimality 

Consider the toy problem of estimating a regres- 
sion function E[Y|Af = x] with one-dimensional pre- 
dictor lei and a continuous response F£l. 



Consider the case with a linear base procedure 
having a hat matrix Ti : M n — > lR n , mapping the re- 
sponse variables Y = (Yi,...,Y n ) T to their fitted 
values (f(Xi), . . . , f(X n )) T . Examples include non- 
parametric kernel smoothers or smoothing splines. It 
is easy to show that the hat matrix of the Li Boosting 
fit (for simplicity, with p°> = and v = 1) in itera- 
tion m equals 



(5.1) 



B m = B m . 1 + H(I-B m . 1 ) 
= I-(I-H) m . 



Formula (5.1) allows for several insights. First, if 
the base procedure satisfies || I — H\\ < 1 for a suit- 
able norm, that is, has a "learning capacity" such 
that the residual vector is shorter than the input- 
response vector, we see that B m converges to the 
identity I as m — ► 00, and B m Y converges to the 
fully saturated model Y, interpolating the response 
variables exactly. Thus, we see here explicitly that 
we have to stop early with the boosting iterations 
in order to prevent over fitting. 

When specializing to the case of a cubic smoothing 
spline base procedure [cf. (4.3)], it is useful to invoke 
some eigenanalysis. The spectral representation is 



n = udu 



u T u- 

D 



uu 



T 



I. 



diag(Ai,...,A n ), 

where Ai > A2 > • • • > A n denote the (ordered) eigen- 
values of 7i. It then follows with (5.1) that 

UD m U T , 



U"i..n 



diag(di >m 
1 - (1 - x,) m 



■ j "'n.mlt 



It is well known that a smoothing spline satisfies 
Ai = A 2 = l, 0<Ai<l (i = 3,...,n). 

Therefore, the eigenvalues of the boosting hat oper- 
ator (matrix) in iteration m satisfy 



(5.2) 
(5.3) 



di, m = d2,m = 1 for all m, 
0<d hm = l-(l-X l ) m <l (i = 3,...,n), 

di, m ^l (m-voo). 



When comparing the spectrum, that is, the set of 
eigenvalues, of a smoothing spline with its boosted 
version, we have the following. For both cases, the 
largest two eigenvalues are equal to 1. Moreover, all 
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Fig. 5. Mean squared prediction error E[(f(X) — f(X)) ] for the regression model 
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runs. Left: £2 Boosting with smoothing spline base procedure (having fixed degrees of freedom df — 4) and using i/ = 0.1, /or 
varying number of boosting iterations. Right: single smoothing spline with varying degrees of freedom. 



other eigenvalues can be changed either by vary- 
ing the degrees of freedom df = Ya=i ^* m a single 
smoothing spline, or by varying the boosting iter- 
ation m with some fixed (low-variance) smoothing 
spline base procedure having fixed (low) values Aj. 
In Figure 5 we demonstrate the difference between 
the two approaches for changing "complexity" of 
the estimated curve fit by means of a toy example 
first shown in [22]. Both methods have about the 
same minimum mean squared error, but L2 Boosting 
overfits much more slowly than a single smoothing 
spline. 

By careful inspection of the eigenanalysis for this 
simple case of boosting a smoothing spline, Buhlmann 
and Yu (author?) [22] proved an asymptotic mini- 
max rate result: 



(without the need of choosing a higher-order spline 
base procedure). 

Recently, asymptotic convergence and minimax 
rate results have been established for early-stopped 
boosting in more general settings [10, 91]. 

5.1.1 LiBoosting using kernel estimators. As we 
have pointed out in Replica 1, L2 Boosting of smooth- 
ing splines can achieve faster mean squared error 
convergence rates than the classical 0(n -4 ' 5 ), as- 
suming that the true underlying function is suffi- 
ciently smooth. We illustrate here a related phe- 
nomenon with kernel estimators. 

We consider fixed, univariate design points Xi = 
i/n (i = 1, . . . , n) and the Nadaraya-Watson kernel 
estimator for the nonparametric regression function 
E[Y\X = x\: 



Replica 1 ([22]). When stopping the boosting it- 
erations appropriately, that is, m st0 p = rn n = 
(2( n 4 /( 2 £+ 1 )) ) fn n —> oo (ji _> oo) with £ > 2 as be- 
low, L2 Boosting with cubic smoothing splines having 
fixed degrees of freedom achieves the minimax con- 
vergence rate over Sobolev function classes of smooth- 
ness degree £,>2, 



as n 



00. 



Two items are interesting. First, minimax rates 
are achieved by using a base procedure with fixed 
degrees of freedom which means low variance from 
an asymptotic perspective. Second, ^Boosting with 
cubic smoothing splines has the capability to adapt 
to higher-order smoothness of the true underlying 
function; thus, with the stopping iteration as the 
one and only tuning parameter, we can neverthe- 
less adapt to any higher-order degree of smoothness 



g{x-h) = {nhy 1 Y J K 



«=i 






Y t 



n 



-^Khfr - Xi)Yi, 



i=l 



where h > is the bandwidth, K(-) is a kernel in 
the form of a probability density which is symmetric 
around zero and Kh(x) = h~ 1 K(x/h). It is straight- 
forward to derive the form of L2 Boosting using m = 
2 iterations (with fw =0 and v = 1), that is, twicing 
[83], with the Nadaraya-Watson kernel estimator: 

fW{x) = {nhr 1 Y J Kt?{x-x i )Y u 



t=i 



K t h w (u)=2K h (u)-K h *K h (u), 
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where 



K h *K h (u) = n 1 ^2K h (u-x r )K h (x r ). 



r=l 



For fixed design points Xi = i/n, the kernel Kj^(-) 
is asymptotically equivalent to a higher-order kernel 
(which can take negative values) yielding a squared 
bias term of order 0(h 8 ), assuming that the true 
regression function is four times continuously differ- 
entiable. Thus, twicing or /^Boosting with m = 2 
iterations amounts to a Nadaraya-Watson kernel 
estimator with a higher-order kernel. This explains 
from another angle why boosting is able to improve 
the mean squared error rate of the base procedure. 
More details including also nonequispaced designs 
are given in DiMarzio and Taylor (author?) [27]. 

5.2 L 2 Boosting for High-Dimensional Linear 
Models 

Consider a potentially high-dimensional linear mo- 
del 



'P®x® 



+ £i, 



l,...,n, 



(5.4) Yi = (3 + ^ 
i=i 
where ei,...,e n are i.i.d. with E[ej] =0 and inde- 
pendent from all Xj's. We allow for the number of 
predictors p to be much larger than the sample size 
n. The model encompasses the representation of a 
noisy signal by an expansion with an overcomplete 
dictionary of functions {<?'•"(•) -j = 1, •• ■ ,p}', for ex- 
ample, for surface modeling with design points in 
Z { eR 2 , 

Y i = f(Z l )+e i , 

/(z) = V/^V%) (zem. 2 ). 



3 



Fitting the model (5.4) can be done using 
/^Boosting with the componentwise linear least 
squares base procedure from Section 4.1 which fits 
in every iteration the best predictor variable reduc- 
ing the residual sum of squares most. This method 
has the following basic properties: 

1. As the number m of boosting iterations increases, 
the .^Boosting estimate p m '(-) converges to a 
least squares solution. This solution is unique if 
the design matrix has full rank p <n. 

2. When stopping early, which is usually needed to 
avoid overfitting, the Z^Boosting method often 
does variable selection. 

3. The coefficient estimates (3^ m ' are (typically) 
shrunken versions of a least squares estimate /3ols j 
related to the Lasso as described in Section 5.2.1. 



Illustration: Breast cancer subtypes. Variable se- 
lection is especially important in high-dimensional 
situations. As an example, we study a binary classi- 
fication problem involving p = 7129 gene expression 
levels in n = 49 breast cancer tumor samples (data 
taken from [90]). For each sample, a binary response 
variable describes the lymph node status (25 nega- 
tive and 24 positive). 

The data are stored in form of an exprSet object 
westbc (see [35]) and we first extract the matrix of 
expression levels and the response variable: 
R> x <- t (exprs (westbc) ) 
R> y <- pData(westbc)$nodal.y 

We aim at using L 2 Boosting for classification (see 
Section 3.2.1), with classical AIC based on the bi- 
nomial log-likelihood for stopping the boosting it- 
erations. Thus, we first transform the factor y to a 
numeric variable with 0/1 coding: 
R> yfit <- as.numeric(y) - 1 

The general framework implemented in mboost al- 
lows us to specify the negative gradient (the ngradient 
argument) corresponding to the surrogate loss func- 
tion, here the squared error loss implemented as a 
function rho, and a different evaluating loss func- 
tion (the loss argument), here the negative bino- 
mial log- likelihood, with the Family function as fol- 
lows: 
R> rho <- function(y, f, w = 1) { 

p <- pmax(pmin(l - le-05, f ) , 

le-05) 
-y * log(p) - (1 - y) 
* log(l - p) 
} 
R> ngradient 

<- function(y, f, w = 1) y - f 
R> offset 

<- function(y, w) 

weighted. mean (y, w) 
R> L2fm <- Family (ngradient = 
ngradient , 
loss = rho, 
offset = offset) 
The resulting object (called L2fm), bundling the 
negative gradient, the loss function and a function 
for computing an offset term (offset), can now 
be passed to the glmboost function for boosting 
with componentwise linear least squares (here ini- 
tial m st0 p = 200 iterations are used) : 
R> Ctrl <- boost_control 

(mstop = 200, 
center = TRUE) 
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R> west_glm <- glmboost 

(x, yfit, 
family = L2fm, 
control = Ctrl) 
Fitting such a linear model to p = 7129 covariates 
for n = 49 observations takes about 3.6 seconds on 
a medium-scale desktop computer (Intel Pentium 4, 
2.8 GHz). Thus, this form of estimation and vari- 
able selection is computationally very efficient. As 
a comparison, computing all Lasso solutions, using 
package lars [28, 39] in R (with use.Gram=FALSE), 
takes about 6.7 seconds. 

The question how to choose m s t op can be addressed 
by the classical AIC criterion as follows: 
R> aic <- AIC(west_glm, 

method = "classical") 
R> mstop(aic) 

[1] 100 

where the AIC is computed as —2 (log- likelihood) + 
2 (degrees of freedom) = 2 (evaluating loss) + 
2(degrees of freedom); see (5.8). The notion of de- 
grees of freedom is discussed in Section 5.3. 

Figure 6 shows the AIC curve depending on the 
number of boosting iterations. When we stop after 
m stop = 100 boosting iterations, we obtain 33 genes 
with nonzero regression coefficients whose standard- 
ized values f}v'\l Var(Xfa)) are depicted in the left 
panel of Figure 6. 

Of course, we could also use BinomialBoosting for 
analyzing the data; the computational CPU time 
would be of the same order of magnitude, that is, 
only a few seconds. 

5.2.1 Connections to the Lasso. Hastie, Tibshi- 
rani and Friedman (author?) [42] pointed out first 
an intriguing connection between Z^Boosting with 
componentwise linear least squares and the Lasso 
[82] which is the following ^-penalty method: 

n / v \ 2 

/5(A) = argminn" 1 V Y - ft - V ^xf 

f> i=l\ j=l ) 

(5.5) 

+ a£I/3°' ) I- 

Efron et al. (author?) [28] made the connection rig- 
orous and explicit: they considered a version of 
/^Boosting, called forward stagewise linear regres- 
sion (FSLR), and they showed that FSLR with in- 
finitesimally small step-sizes (i.e., the value v in step 



4 of the /^Boosting algorithm in Section 3.3.1) pro- 
duces a set of solutions which is approximately equiv- 
alent to the set of Lasso solutions when varying the 
regularization parameter A in Lasso [see (5.5)]. The 
approximate equivalence is derived by representing 
FSLR and Lasso as two different modifications of 
the computationally efficient least angle regression 
(LARS) algorithm from Efron et al. (author?) [28] 
(see also [68] for generalized linear models). The lat- 
ter is very similar to the algorithm proposed earlier 
by Osborne, Presnell and Turlach (author?) [67]. 
In special cases where the design matrix satisfies a 
"positive cone condition," FSLR, Lasso and LARS 
all coincide ([28], page 425). For more general situ- 
ations, when adding some backward steps to boost- 
ing, such modified ^Boosting coincides with the 
Lasso (Zhao and Yu (author?) [93]). 

Despite the fact that /^Boosting and Lasso are 
not equivalent methods in general, it may be use- 
ful to interpret boosting as being "related" to l 1 - 
penalty based methods. 

5.2.2 Asymptotic consistency in high dimensions. 
We review here a result establishing asymptotic con- 
sistency for very high-dimensional but sparse linear 
models as in (5.4). To capture the notion of high- 
dimensionality, we equip the model with a dimen- 
sionality p = p n which is allowed to grow with sam- 
ple size n; moreover, the coefficients (3^) = /3%' are 
now potentially depending on n and the regression 
function is denoted by f n {')- 

Replica 2 ([18]). Consider the linear model in 
(5.4). Assume that p n = 0(exp(n 1- ^)) for some < 

£ < 1 (high- dimensionality) and sup ne ^J2 P jZ\ \0n \ < 
co (sparseness of the true regression function w.r.t. 

the I 1 -norm); moreover, the variables X\ are 
bounded and E[|ej| 4 '^] < oo. Then: when stopping 
the boosting iterations appropriately, that is, m = 
m n — ► oo (n — > oo) sufficiently slowly, L2Boosting 
with componentwise linear least squares satisfies 

Ex ncw [(^ m " ] (*ncw) " fn(X ncw )) 2 ] - 

in probability (n — > oo), 

where X QCW denotes new predictor variables, inde- 
pendent of and with the same distribution as the 
X -component of the data (Aj, Y) (i = l, ■ ■ ■ ,n). 

The result holds for almost arbitrary designs and 
no assumptions about collinearity or correlations are 
required. Replica 2 identifies boosting as a method 
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Fig. 6. westbc data: Standardized regression coefficients $ y Var(XC?)) (left panel) for m Btop = 100 determined from the 
classical AIC criterion shown in the right panel. 



which is able to consistently estimate a very high- 
dimensional but sparse linear model; for the Lasso in 
(5.5), a similar result holds as well [37]. In terms of 
empirical performance, there seems to be no overall 
superiority of L2 Boosting over Lasso or vice versa. 

5.2.3 Transforming predictor variables. In view of 
Replica 2, we may enrich the design matrix in model 
(5.4) with many transformed predictors: if the true 
regression function can be represented as a sparse 
linear combination of original or transformed pre- 
dictors, consistency is still guaranteed. It should be 
noted, though, that the inclusion of noneffective vari- 
ables in the design matrix does degrade the finite- 
sample performance to a certain extent. 

For example, higher-order interactions can be spec- 
ified in generalized AN(C)OVA models and 
/^Boosting with componentwise linear least squares 
can be used to select a small number out of poten- 
tially many interaction terms. 

As an option for continuously measured covari- 
ates, we may utilize a B-spline basis as illustrated 
in the next paragraph. We emphasize that during 
the process of /^Boosting with componentwise lin- 
ear least squares, individual spline basis functions 
from various predictor variables are selected and fit- 
ted one at a time; in contrast, .^Boosting with com- 
ponentwise smoothing splines fits a whole smoothing 
spline function (for a selected predictor variable) at 
a time. 

Illustration: Prediction of total body fat (cont.). 
Such transformations and estimation of a correspond- 
ing linear model can be done with the glmboost 



function, where the model formula performs the com- 
putations of all transformations by means of the bs 
(B-spline basis) function from the package splines. 
First, we set up a formula transforming each covari- 
ate: 
R> bsfm 

DEXfat " bs(age) + bs(waistcirc) + 
bs(hipcirc) + bs(elbowbreadth) + 
bs (kneebreadth) + bs(anthro3a) + 
bs(anthro3b) + bs(anthro3c) + 
bs(anthro4) 

and then fit the complex linear model by using the 
glmboost function with initial m stop = 5000 boost- 
ing iterations: 
R> Ctrl <- boost_control 

(mstop = 5000) 
R> bf_bs <- glmboost 

(bsfm, data = bodyfat, 
control = Ctrl) 
R> mstop (aic <- AIC(bf_bs)) 

[1] 2891 

The corrected AIC criterion (see Section 5.4) sug- 
gests to stop after 7n s top = 2891 boosting iterations 
and the final model selects 21 (transformed) pre- 
dictor variables. Again, the partial contributions of 
each of the nine original covariates can be com- 
puted easily and are shown in Figure 7 (for the same 
variables as in Figure 3). Note that the depicted 
functional relationship derived from the model fit- 
ted above (Figure 7) is qualitatively the same as the 
one derived from the additive model (Figure 3). 
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Fig. 7. bodyfat data: Partial fits for a linear model fitted to transformed covariates using B-splines (without centering of 
estimated functions to mean zero). 



5.3 Degrees of Freedom for ^Boosting 

A notion of degrees of freedom will be useful for 
estimating the stopping iteration of boosting (Sec- 
tion 5.4). 

5.3.1 Componentwise linear least squares. We con- 
sider ^Boosting with componentwise linear least 
squares. Denote by 

W 0') = x0')(xW) T /||xWf, j = l,...,p, 

the n x n hat matrix for the linear least squares fit- 
ting operator using the jth predictor variable X^') = 



-U) 



(VKJ) yl 

ly 1 ! ) • • • >^-n 



U)\T 



only; 



x T x 



denotes the 

Euclidean norm for a vector x £ W 1 . The hat ma- 
trix of the componentwise linear least squares base 
procedure [see (4.1)] is then 



7i {S) :(U 1 ,...,U n )^U 1 



,Un 



where S is as in (4.1). Similarly to (5.1), we then 
obtain the hat matrix of L2 Boosting in iteration m: 



B„ 



Bm-l + V-H^Xl-Bm-!) 



(5.6) = I - (I - u7i {im) ) 

■{I - vH {Sm - l] ) ■•■(/- vH^), 

where S r S {1, . . . ,p} denotes the component which 
is selected in the componentwise least squares base 
procedure in the rth boosting iteration. We empha- 
size that B m is depending on the response variable 
Y via the selected components <S r , r = 1, . . . , m. Due 
to this dependence on Y, B m should be viewed as an 
approximate hat matrix only. Neglecting the selec- 
tion effect of S r (r = 1, . . . , m), we define the degrees 
of freedom of the boosting fit in iteration m as 

df(m) = trace(B m ). 

Even with v = 1 , df(m) is very different from count- 
ing the number of variables which have been selected 
until iteration m. 

Having some notion of degrees of freedom at hand, 
we can estimate the error variance a^ = E[e|] in the 
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linear model (5.4) by 
1 



cr: 



n - df(m s to P j — ] 
Moreover, we can represent 



E(^-/ [mst ° p] (^)f 



(5.7) 



B m — 2_^i "m > 



i(j) 



where Bm is the (approximate) hat matrix which 
yields the fitted values for the jth predictor, that is, 
B$Y = X^H. Note that the B^'s can be easily 
computed in an iterative way by updating as follows: 

B(f-) = B&l + v ■ H^\l - B m ^), 

B$=B ( £_ X for all j ^S m . 

Thus, we have a decomposition of the total degrees 
of freedom into p terms: 

p 

i=i 

dfk">(m)=trace(B<£). 

The individual degrees of freedom dv^f (rn) are a use- 
ful measure to quantify the "complexity" of the in- 
dividual coefficient estimate /3j . 

5.4 Internal Stopping Criteria for ^Boosting 

Having some degrees of freedom at hand, we can 
now use information criteria for estimating a good 
stopping iteration, without pursuing some sort of 
cross-validation. 

We can use the corrected AIC [49]: 

at/-. / n , r-2\ 1 + df(m)/n 

AIC c (m = log a 2 + — >' , 

(1 — df(m) + 2)/n 

n 

^ = n- 1 £(y i -(B m Y) i ) 2 . 

In mboost, the corrected AIC criterion can be com- 
puted via AIC (x, method = "corrected") (with x 
being an object returned by glmboost or gamboost 
called with family = GaussRegO). Alternatively, 
we may employ the gMDL criterion (Hansen and 
Yu (author?) [38]): 

gMDL(m) = log (S) + ^^ log(F), 

n 



S 



no 



n — df(m) 



,F 



df(m)5 



The gMDL criterion bridges the AIC and BIC in 
a data-driven way: it is an attempt to adaptively 
select the better among the two. 

When using Li Boosting for binary classification 
(see also the end of Section 3.2 and the illustration 
in Section 5.2), we prefer to work with the binomial 
log-likelihood in AIC, 



AIC(m) = -2£*ilog((B m Y) 



i=\ 



(5.8) 



+ (l-Y i )log(l-(B m Y) i 

+ 2df(m), 



or for BIC(m) with the penalty term log(n)df(m). 
(If (B m Y)i £ [0, 1], we truncate by max(min((£> m Y)i, 
1 — 5), 5) for some small 5 > 0, for example, 5 = 
1CT 5 .) 

6. BOOSTING FOR VARIABLE SELECTION 

We address here the question whether boosting 
is a good variable selection scheme. For problems 
with many predictor variables, boosting is compu- 
tationally much more efficient than classical all sub- 
set selection schemes. The mathematical properties 
of boosting for variable selection are still open ques- 
tions, for example, whether it leads to a consistent 
model selection method. 

6.1 L 2 Boosting 

When borrowing from the analogy of .^Boosting 
with the Lasso (see Section 5.2.1), the following is 
relevant. Consider a linear model as in (5.4), al- 
lowing for p> n but being sparse. Then, there is 
a sufficient and "almost" necessary neighborhood 
stability condition (the word "almost" refers to a 
strict inequality "<" whereas "<" suffices for suffi- 
ciency) such that for some suitable penalty param- 
eter A in (5.5), the Lasso finds the true underly- 
ing submodel (the predictor variables with corre- 
sponding regression coefficients ^ 0) with probabil- 
ity tending quickly to 1 as n — > oo [65]. It is im- 
portant to note the role of the sufficient and "al- 
most" necessary condition of the Lasso for model 
selection: Zhao and Yu (author?) [94] call it the 
"irrepresentable condition" which has (mainly) im- 
plications on the "degree of collinearity" of the de- 
sign (predictor variables), and they give examples 
where it holds and where it fails to be true. A fur- 
ther complication is the fact that when tuning the 
Lasso for prediction optimality, that is, choosing the 
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reweighting the penalty function. Instead of (5.5), 
the adaptive Lasso estimator is 



Fig. 8. Hard-threshold (dotted- dashed), soft-threshold (dot- 
ted) and adaptive Lasso (solid) estimator in a linear model 
with orthonormal design. For this design, the adaptive Lasso 
coincides with the nonnegative garrote [13]. The value on the 
x-abscissa, denoted by z, is a single component o/X T Y. 



penalty parameter A in (5.5) such that the mean 
squared error is minimal, the probability for esti- 
mating the true submodel converges to a number 
which is less than 1 or even zero if the problem 
is high-dimensional [65]. In fact, the prediction op- 
timal tuned Lasso selects asymptotically too large 
models. 

The bias of the Lasso mainly causes the difficulties 
mentioned above. We often would like to construct 
estimators which are less biased. It is instructive to 
look at regression with orthonormal design, that is, 
the model (5.4) with £™ =1 x\ j) x\ k) = 5 jk . Then, the 
Lasso and also L2Boosting with componentwise lin- 
ear least squares and using very small v (in step 
4 of L2Boosting; see Section 3.3.1) yield the soft- 
threshold estimator [23, 28]; see Figure 8. It exhibits 
the same amount of bias regardless by how much the 
observation (the variable z in Figure 8) exceeds the 
threshold. This is in contrast to the hard-threshold 
estimator and the adaptive Lasso in (6.1) which are 
much better in terms of bias. 

Nevertheless, the (computationally efficient) Lasso 
seems to be a very useful method for variable filter- 
ing: for many cases, the prediction optimal tuned 
Lasso selects a submodel which contains the true 
model with high probability. A nice proposal to cor- 
rect Lasso's overestimation behavior is the adaptive 
Lasso, given by Zou (author?) [96]. It is based on 



(6.1) 



/3(A) = argminn" 1 £ ( Y, - fo - £ ^X^ 

f 3 i=l V j=l / 



+ x±^ 



j=l I Mini t 



I A 



where /3; n it is an initial estimator, for example, the 
Lasso (from a first stage of Lasso estimation) . Con- 
sistency of the adaptive Lasso for variable selection 
has been proved for the case with fixed predictor- 
dimension p [96] and also for the high-dimensional 
case with p = p n ^> n [48] . 

We do not expect that boosting is free from the 
difficulties which occur when using the Lasso for 
variable selection. The hope is, though, that also 
boosting would produce an interesting set of sub- 
models when varying the number of iterations. 

6.2 Twin Boosting 

Twin Boosting [19] is the boosting analogue to the 
adaptive Lasso. It consists of two stages of boosting: 
the first stage is as usual, and the second stage is 
enforced to resemble the first boosting round. For 
example, if a variable has not been selected in the 
first round of boosting, it will not be selected in 
the second; this property also holds for the adaptive 
Lasso in (6.1), that is, (3^{ t = enforces (3^' = 0. 
Moreover, Twin Boosting with componentwise lin- 
ear least squares is proved to be equivalent to the 
adaptive Lasso for the case of an orthonormal lin- 
ear model and it is empirically shown, in general 
and for various base procedures and models, that it 
has much better variable selection properties than 
the corresponding boosting algorithm [19]. In special 
settings, similar results can be obtained with Sparse 
Boosting [23]; however, Twin Boosting is much more 
generically applicable. 

7. BOOSTING FOR EXPONENTIAL FAMILY 
MODELS 

For exponential family models with general loss 
functions, we can use the generic FGD algorithm as 
described in Section 2.1. 

First, we address the issue about omitting a line 
search between steps 3 and 4 of the generic FGD 
algorithm. Consider the empirical risk at iteration 
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m. 



n 



^p(YJ [m] (X t )) 

n 

(7.1) ^rT 1 $>(*■, /^(X,)) 



i=\ 



vn 



- 1 £[/-#"] (X,), 



j=i 



using a first-order Taylor expansion and the defini- 
tion of Ui. Consider the case with the component- 
wise linear least squares base procedure and without 
loss of generality with standardized predictor vari- 
ables [i.e., n" 1 E?=i(-X'? ) ) 2 = 1 for ah ]]■ Then, 

n 

$ m \x)=n- 1 Y J UiX ! f m) x { - Sm \ 

i=\ 

and the expression in (7.1) becomes 
-ifX^/HpQ)) 



n 



i=l 



(7.2) 



n 



■ ly Ep(YiJ^(Xi)) 



i=\ 



v\ n 



-'EUiXl 



(S m ) 



i=l 



In case of the squared error loss Pi 2 (y,/) = \y 
/| 2 /2, we obtain the exact identity: 



7) 



i=l 



-^pUYiJ^X,)) 

n 

^ 1 Y / PL 2 (Yj [m " l] (X t )) 



t=l 



v{\- V /2)[n- x Y,UiX\ 



(Sm) 



\ i=l / 

Comparing this with (7.2), we see that functional 
gradient descent with a general loss function and 
without additional line-search behaves very similarly 
to /^Boosting (since v is small) with respect 
to optimizing the empirical risk; for 
/^Boosting, the numerical convergence rate is 
n-^UPLMJ^iXi)) = 0(m-V6) (m _ x) 
[81]. This completes our reasoning why the line- 
search in the general functional gradient descent al- 
gorithm can be omitted, of course at the price of 



doing more iterations but not necessarily more com- 
puting time (since the line-search is omitted in every 
iteration) . 

7.1 BinomialBoosting 

For binary classification with Y € {0, 1}, Binomi- 
alBoosting uses the negative binomial log-likelihood 
from (3.1) as loss function. The algorithm is de- 
scribed in Section 3.3.2. Since the population min- 
imizer is f*(x) =\og\p(x)/(l — p(x))]/2, estimates 
from BinomialBoosting are on half of the logit-scale: 
the componentwise linear least squares base proce- 
dure yields a logistic linear model fit while using 
componentwise smoothing splines fits a logistic ad- 
ditive model. Many of the concepts and facts from 
Section 5 about L2 Boosting become useful heuris- 
tics for BinomialBoosting. 

One principal difference is the derivation of the 
boosting hat matrix. Instead of (5.6), a linearization 
argument leads to the following recursion [assuming 
p°'(-) = 0] for an approximate hat matrix B m : 

(m>2), 
WW = dlag(pW(Xi)(l -pH(Xi); 1 < i < n)). 



(7.3) 



A derivation is given in Appendix A. 2. Degrees of 
freedom are then defined as in Section 5.3, 

df(m) = trace(/3 m ), 

and they can be used for information criteria, for 
example, 

n 

AIC(m) = -2j2[Y i log(pW(X i )) 

+ (i-y 4 )iog(i-P H (^))] 

+ 2df(m), 

or for BIC(m) with the penalty term log(n)df(m). 
In mboost, this AIC criterion can be computed via 
AIC(x, method = "classical") (with x being an 
object returned by glmboost or gamboost called 
with family = BinomialQ). 

Illustration: Wisconsin prognostic breast cancer. 
Prediction models for recurrence events in breast 
cancer patients based on covariates which have been 
computed from a digitized image of a fine needle as- 
pirate of breast tissue (those measurements describe 
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characteristics of the cell nuclei present in the im- 
age) have been studied by Street, Mangasarian and 
Wolberg (author?) [80] (the data are part of the UCI 
repository [11]). 

We first analyze these data as a binary prediction 
problem (recurrence vs. nonrecurrence) and later in 
Section 8 by means of survival models. We are faced 
with many covariates (p = 32) for a limited number 
of observations without missing values (n = 194), 
and variable selection is an important issue. We can 
choose a classical logistic regression model via AIC 
in a stepwise algorithm as follows: 
R> cc <- complete. cases (wpbc) 
R> wpbc 2 

<- wpbc[cc, 

colnames(wpbc) != "time"] 
R> wpbc_step 

<- step(glm(status ~ . , 

data = wpbc2, 
family = binomial ()), 
trace = 0) 
The final model consists of 16 parameters with 
R> logLik(wpbc_step) 

'log Lik. ' -80.13 (df=16) 

R> AIC(wpbc_step) 

[1] 192.26 

and we want to compare this model to a logistic re- 
gression model fitted via gradient boosting. We sim- 
ply select the Binomial family [with default offset of 
l/21og(p/(l — p)), where p is the empirical propor- 
tion of recurrences] and we initially use m stop = 500 
boosting iterations: 
R> Ctrl <- boost_control 

(mstop = 500, 
center = TRUE) 
R> wpbc_glm 

<- glmboost (status ~ . , 

data = wpbc2, 
family = Binomial (), 
control = Ctrl) 
The classical AIC criterion (— 21og-likelihood + 2df) 
suggests to stop after 

R> aic <- AIC(wpbc_glm, "classical") 
R> aic 

[1] 199.54 

Optimal number of boosting iterations: 465 

Degrees of freedom (for mstop = 465): 9.147 

boosting iterations. We now restrict the number of 
boosting iterations to m stop = 465 and then obtain 
the estimated coefficients via 
R> wpbc_glm <- wpbc_glm [mstop (aic)] 
R> coef (wpbc_glm) 

[abs(coef (wpbc_glm) ) > 0] 



(Intercept) mean_radius mean_texture 

-1.2511e-01 -5.8453e-03 -2.4505e-02 

mean_smoothness mean_symmetry mean_f ractaldim 

2.8513e+00 -3.9307e+00 -2.8253e+01 

SE_texture SE_perimeter SE_compactness 

-8.7553e-02 5.4917e-02 1 . 1463e+01 

SE_concavity SE_concavepoints SE_symmetry 

-6.9238e+00 -2.0454e+01 5.2125e+00 

SE_f ractaldim worst_radius worst_perimeter 

5.2187e+00 1.3468e-02 1.2108e-03 

worst_area worst_smoothness worst_compactness 

1.8646e-04 9.9560e+00 -1.9469e-01 

tsize pnodes 

4.1561e-02 2.4445e-02 

(Because of using the offset- value p ', we have to 
add the value p°> to the reported intercept estimate 
above for the logistic regression model.) 

A generalized additive model adds more flexibility 
to the regression function but is still interpretable. 
We fit a logistic additive model to the wpbc data as 
follows: 
R> wpbc_gam <- gamboost (status " ., 

data = wpbc2, 

family = Binomial ()) 
R> mopt <- mstop (aic <- 

AIC(wpbc_gam, "classical")) 
R> aic 

[1] 199.76 

Optimal number of boosting iterations: 99 

Degrees of freedom (for mstop = 99): 14.583 

This model selected 16 out of 32 covariates. The 
partial contributions of the four most important vari- 
ables are depicted in Figure 9 indicating a remark- 
able degree of nonlinearity. 

7.2 Poisson Boosting 

For count data with Y G {0, 1,2,...}, we can use 
Poisson regression: we assume that Y\ X = x has a 
Poisson(A(x)) distribution and the goal is to esti- 
mate the function f(x) = log(A(x)). The negative 
log-likelihood yields then the loss function 

P (y,f) = -yf + eMf), / = log(A), 

which can be used in the functional gradient descent 
algorithm in Section 2.1, and it is implemented in 
mboost as PoissonO family. 

Similarly to (7.3), the approximate boosting hat 
matrix is computed by the following recursion: 



(7.4) 






Bm-i + vW [m - 1] H {Sm) (I - B, 



m—l. 



(m>2), 



VF H =diag(A H (A i ); 1<*< 



n). 
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Fig. 9. wpbc data: Partial contributions of four selected covariates in an additive logistic model (without centering of estimated 
functions to mean zero). 



7.3 Initialization of Boosting 

We have briefly described in Sections 2.1 and 4.1 
the issue of choosing an initial value p ' (•) for boost- 
ing. This can be quite important for applications 
where we would like to estimate some parts of a 
model in an unpenalized (nonregularized) fashion, 
with others being subject to regularization. 

For example, we may think of a parametric form of 
/[°1(-), estimated by maximum likelihood, and devi- 
ations from the parametric model would be built in 
by pursuing boosting iterations (with a nonparamet- 
ric base procedure). A concrete example would be: 
p°'(-) is the maximum likelihood estimate in a gen- 
eralized linear model and boosting would be done 
with componentwise smoothing splines to model ad- 
ditive deviations from a generalized linear model. A 
related strategy has been used in [4] for modeling 
multivariate volatility in financial time series. 

Another example would be a linear model Y = 
X/3 + e as in (5.4) where some of the predictor vari- 
ables, say the first q predictor variables X^ 1 ' , . . . , X^ q > , 



enter the estimated linear model in an unpenalized 
way. We propose to do ordinary least squares re- 
gression on X^ 1 ' , . . . , X^ q '\ consider the projection 
P q onto the linear span of X^- 1 ' , . . . , X^ q > and use 
/^Boosting with componentwise linear least squares 
on the new response (/ — P q ) Y and the new (p — q)- 
dimensional predictor (/— P g )X. The final model es- 

timate is then £« =1 $ OLSJ x^ +E P j=q +i /3J mstop] x^, 
where the latter part is from L2Boosting and xv) is 
the residual when linearly regressing x^' to x' 1 ', . . . , 
x( q > . A special case which is used in most applica- 
tions is with q = 1 and X^ 1 ' = 1 encoding for an in- 
tercept. Then, (I-Pi)Y = Y-F and (I-Pi)XW = 

X") — n _1 Ya=i ^i ■ This is exactly the proposal at 
the end of Section 4.1. For generalized linear models, 
analogous concepts can be used. 

8. SURVIVAL ANALYSIS 

The negative gradient of Cox's partial likelihood 
can be used to fit proportional hazards models to 
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censored response variables with boosting algorithms 
[71]. Of course, all types of base procedures can 
be utilized; for example, componentwise linear least 
squares fits a Cox model with a linear predictor. 

Alternatively, we can use the weighted least squares 
framework with weights arising from inverse proba- 
bility censoring. We sketch this approach in the se- 
quel; details are given in [45]. We assume complete 
data of the following form: survival times T{ € M + 
(some of them right-censored) and predictors X{ € 
M p , i = 1, ... ,7i. We transform the survival times to 
the log-scale, but this step is not crucial for what 
follows: Yi = log(Tj). What we observe is 

O i = (Y l ,X i ,A i ), 

Y i = log(f i ), 

fi = min(Tj,Ci), 

where Aj = J(Tj < Cj) is a censoring indicator and 
Ci is the censoring time. Here, we make a restrictive 
assumption that Ci is conditionally independent of 
Ti given Xi (and we assume independence among 
different indices i); this implies that the coarsening 
at random assumption holds [89]. 

We consider the squared error loss for the com- 
plete data, p(y, f) = \y — f\ 2 (without the irrelevant 
factor 1/2). For the observed data, the following 
weighted version turns out to be useful: 



Pobs(o,/) = (y-/) 2 A 



l 



G(t\x) 



G(c\x)=F[C>c\X = x\. 

Thus, the observed data loss function is weighted 
by the inverse probability for censoring AG(t\x) 
(the weights are inverse probabilities of censoring; 
IPC). Under the coarsening at random assumption, 
it then holds that 

E Y>X [(Y - f(X)) 2 ]=E [poUO,f(Xm 

see van der Laan and Robins (author?) [89]. 

The strategy is then to estimate G(-\x), for exam- 
ple, by the Kaplan-Meier estimator, and do weighted 
Li Boosting using the weighted squared error loss: 






1 



G(Ti\Xi 



(iWPQf, 



where the weights are of the form AjG'(Tj|Xj) _1 
(the specification of the estimator G(t\x) may play a 
substantial role in the whole procedure). As demon- 
strated in the previous sections, we can use various 



base procedures as long as they allow for weighted 
least squares fitting. Furthermore, the concepts of 
degrees of freedom and information criteria are anal- 
ogous to Sections 5.3 and 5.4. Details are given in 

[45]. 

Illustration: Wisconsin prognostic breast cancer [cont. 
Instead of the binary response variable describing 
the recurrence status, we make use of the addition- 
ally available time information for modeling the time 
to recurrence; that is, all observations with nonre- 
currence are censored. First, we calculate IPC weights: 
R> censored <- wpbc$status == "R" 
R> iw <- IPCweights(Surv(wpbc$time, 

censored)) 
R> wpbc3 <- wpbc[, names (wpbc) != 
"status"] 
and fit a weighted linear model by boosting with 
componentwise linear weighted least squares as base 
procedure: 
R> Ctrl <- boost_control( 

mstop = 500, center = TRUE) 
R> wpbc_surv <- glmboost( 

log(time) ~ . , data = wpbc3, 
control = ctrl, weights = iw) 
R> mstop (aic <- AIC(wpbc_surv)) 

[1] 122 

R> wpbc_surv <- wpbc_surv[ 

mstop(aic)] 
The following variables have been selected for fit- 
ting: 
R> names (coef (wpbc_surv) 

[abs(coef (wpbc_surv)) > 0] ) 

[1] "mean_radius" "mean_texture" 

[3] "mean_periineter" "mean_smoothness" 

[5] "mean_symnietry" "SE_texture" 

[7] "SE_smoothness" "SE_concavepoints" 

[9] "SE_symmetry" "worst_concavepoints" 

and the fitted values are depicted in Figure 10, 
showing a reasonable model fit. 

Alternatively, a Cox model with linear predictor 
can be fitted using Z^Boosting by implementing the 
negative gradient of the partial likelihood (see [71]) 
via 
R> ctrl <- boost_control 

(center = TRUE) 
R> glmboost 

(Surv(wpbc$time, 

wpbc$status == "N") ~ ., 
data = wpbc, 
family = CoxPHO , 
control = ctrl) 
For more examples, such as fitting an additive Cox 
model using mboost, see [44]. 
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Fig. 10. wpbc data: Fitted values of an IPC-weighted linear 
model, taking both time to recurrence and censoring informa- 
tion into account. The radius of the circles is proportional 
to the IPC weight of the corresponding observation; censored 
observations with IPC weight zero are not plotted. 

9. OTHER WORKS 

We briefly summarize here some other works which 
have not been mentioned in the earlier sections. A 
very different exposition than ours is the overview 
of boosting by Meir and Ratsch (author?) [66]. 

9.1 Methodology and Applications 

Boosting methodology has been used for various 
other statistical models than what we have discussed 
in the previous sections. Models for multivariate re- 
sponses are studied in [20, 59]; some multiclass boost- 
ing methods are discussed in [33, 95]. Other works 
deal with boosting approaches for generalized linear 
and nonparametric models [55, 56, 85, 86], for flex- 
ible semiparametric mixed models [88] or for non- 
parametric models with quality constraints [54, 87]. 
Boosting methods for estimating propensity scores, 
a special weighting scheme for modeling observa- 
tional data, are proposed in [63] . 

There are numerous applications of boosting meth- 
ods to real data problems. We mention here classifi- 
cation of tumor types from gene expressions [25, 26], 
multivariate financial time series [2, 3, 4], text classi- 
fication [78] , document routing [50] or survival anal- 
ysis [8] (different from the approach in Section 8). 

9.2 Asymptotic Theory 

The asymptotic analysis of boosting algorithms 
includes consistency and minimax rate results. The 



first consistency result for AdaBoost has been given 
by Jiang (author?) [51], and a different constructive 
proof with a range for the stopping value m stop = 
m stop,n is given in [7]. Later, Zhang and Yu (au- 
thor?) [92] generalized the results for a functional 
gradient descent with an additional relaxation scheme, 
and their theory covers also more general loss func- 
tions than the exponential loss in AdaBoost. For 
/^Boosting, the first minimax rate result has been 
established by Biihlmann and Yu (author?) [22]. 
This has been extended to much more general set- 
tings by Yao, Rosasco and Caponnetto (author?) 
[91] and Bissantz et al. (author?) [10] . 

In the machine learning community, there has been 
a substantial focus on estimation in the convex hull 
of function classes (cf. [5, 6, 58]). For example, one 
may want to estimate a regression or probability 
function by using 

oo oo 



fc=i 



fe=i 



where the g^ k '(-)'s belong to a function class such 
as stumps or trees with a fixed number of terminal 
nodes. The estimator above is a convex combina- 
tion of individual functions, in contrast to boost- 
ing which pursues a linear combination. By scaling, 
which is necessary in practice and theory (cf. [58]), 
one can actually look at this as a linear combination 
of functions whose coefficients satisfy J2k^k = ^- 
This then represents an ^-constraint as in Lasso, 
a relation which we have already seen from another 
perspective in Section 5.2.1. Consistency of such con- 
vex combination or ^-regularized "boosting" meth- 
ods has been given by Lugosi and Vayatis (author?) 
[58]. Mannor, Meir and Zhang (author?) [61] and 
Blanchard, Lugosi and Vayatis (author?) [12] de- 
rived results for rates of convergence of (versions of) 
convex combination schemes. 

APPENDIX A.l: SOFTWARE 

The data analyses presented in this paper have 
been performed using the mboost add-on package to 
the R system of statistical computing. The theoret- 
ical ingredients of boosting algorithms, such as loss 
functions and their negative gradients, base learn- 
ers and internal stopping criteria, find their com- 
putational counterparts in the mboost package. Its 
implementation and user-interface reflect our statis- 
tical perspective of boosting as a tool for estimation 
in structured models. For example, and extending 
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the reference implementation of tree-based gradient 
boosting from the gbm package [74], mboost allows 
to fit potentially high-dimensional linear or smooth 
additive models, and it has methods to compute de- 
grees of freedom which in turn allow for the use of 
information criteria such as AIC or BIC or for esti- 
mation of variance. Moreover, for high-dimensional 
(generalized) linear models, our implementation is 
very fast to fit models even when the dimension of 
the predictor space is in the ten-thousands. 

The Family function in mboost can be used to 
create an object of class boost-family implementing 
the negative gradient for general surrogate loss func- 
tions. Such an object can later be fed into the fit- 
ting procedure of a linear or additive model which 
optimizes the corresponding empirical risk (an ex- 
ample is given in Section 5.2). Therefore, we are not 
limited to already implemented boosting algorithms, 
but can easily set up our own boosting procedure by 
implementing the negative gradient of the surrogate 
loss function of interest. 

Both the source version as well as binaries for 
several operating systems of the mboost [43] pack- 
age are freely available from the Comprehensive R 
Archive Network (http://CRAN.R-project.org). 
The reader can install our package directly from the 
R prompt via 
R> install. packages ( "mboost " , 

dependencies = 
TRUE) 
R> library ("mboost") 

All analyses presented in this paper are contained 
in a package vignette. The rendered output of the 
analyses is available by the R-command 
R> vignette("mboost_illustrations" , 
package = "mboost") 

whereas the R code for reproducibility of our anal- 
yses can be assessed by 
R> edit (vignette 

("mboost_illustrations" , 
package = "mboost")) 

There are several alternative implementations of 
boosting techniques available as R add-on packages. 
The reference implementation for tree-based gradi- 
ent boosting is gbm [74] . Boosting for additive mod- 
els based on penalized B-splines is implemented in 
GAMBoost [9, 84]. 



APPENDIX A.2: DERIVATION OF BOOSTING 
HAT MATRICES 

Derivation of (7.3). The negative gradient is 





df 



p(y,f) =2(y-p), 



P 



exp(/) 



exp(/)+exp(-/)' 

Next, we linearize p^ m ': we denote p^- m ' = (p^ m '(Xi) 
. . . ,p^ m \X n )) T and analogously for /H. Then, 



P L 



: P 



(A.l) 



[m-l] , dP 
df 



f=r 



Jf [ 



f[m-l]\ 



j[m-i] + 2W^ m ~ 1 ^uH ( - Sm h(Y -pl m - l 1] 



where WH = diag(p(X;)(l-p(JSQ)); 1 <i <n). Since 
for the hat matrix, £> m Y = p'- 771 ' , we obtain from 

(A.l) 

B m « £ m _i + uAW^-^H^(I - Bm-x) (m > 2), 

which shows that (7.3) is approximately true. 

Derivation of formula (7.4). The arguments are 
analogous to those for the binomial case above. Here, 
the negative gradient is 

d 
—QfP(yJ)=y-\ A = exp(/). 

When linearizing A^ = (AH(Xi), . . . , \W(X n )) T 
we get, analogously to (A.l), 



A'' 



A' 



m— 11 



+ 



df 



f=r- 



JfM_f{m-l] ) 



= A!™- 1 ! + w [m ~ ^uH^ m \Y - A [m ~ 1] ), 

where W^ = diag(A(Aj)); 1 < i < n. We then com- 
plete the derivation of (7.4) as in the binomial case 
above. 
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